51 research outputs found

    Reliability of GPU-based heterogeneous systems

    Get PDF

    Designs for increasing reliability while reducing energy and increasing lifetime

    Get PDF
    In the last decades, the computing technology experienced tremendous developments. For instance, transistors' feature size shrank to half at every two years as consistently from the first time Moore stated his law. Consequently, number of transistors and core count per chip doubles at each generation. Similarly, petascale systems that have the capability of processing more than one billion calculation per second have been developed. As a matter of fact, exascale systems are predicted to be available at year 2020. However, these developments in computer systems face a reliability wall. For instance, transistor feature sizes are getting so small that it becomes easier for high-energy particles to temporarily flip the state of a memory cell from 1-to-0 or 0-to-1. Also, even if we assume that fault-rate per transistor stays constant with scaling, the increase in total transistor and core count per chip will significantly increase the number of faults for future desktop and exascale systems. Moreover, circuit ageing is exacerbated due to increased manufacturing variability and thermal stresses, therefore, lifetime of processor structures are becoming shorter. On the other side, due to the limited power budget of the computer systems such that mobile devices, it is attractive to scale down the voltage. However, when the voltage level scales to beyond the safe margin especially to the ultra-low level, the error rate increases drastically. Nevertheless, new memory technologies such as NAND flashes present only limited amount of nominal lifetime, and when they exceed this lifetime, they can not guarantee storing of the data correctly leading to data retention problems. Due to these issues, reliability became a first-class design constraint for contemporary computing in addition to power and performance. Moreover, reliability even plays increasingly important role when computer systems process sensitive and life-critical information such as health records, financial information, power regulation, transportation, etc. In this thesis, we present several different reliability designs for detecting and correcting errors occurring in processor pipelines, L1 caches and non-volatile NAND flash memories due to various reasons. We design reliability solutions in order to serve three main purposes. Our first goal is to improve the reliability of computer systems by detecting and correcting random and non-predictable errors such as bit flips or ageing errors. Second, we aim to reduce the energy consumption of the computer systems by allowing them to operate reliably at ultra-low voltage level. Third, we target to increase the lifetime of new memory technologies by implementing efficient and low-cost reliability schemes

    A runtime heuristic to selectively replicate tasks for application-specific reliability targets

    Get PDF
    In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.This work was supported by FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and in part by the European Union (FEDER funds) under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Circuit design of a novel adaptable and reliable L1 data cache

    Get PDF
    This paper proposes a novel adaptable and reliable L1 data cache design (Adapcache) with the unique capability of automatically adapting itself for different supply voltage levels and providing the highest reliability. Depending on the supply voltage level, Adapcache defines three operating modes: In high supply voltages, Adapcache provides reliability through single-bit parity. In middle range of supply voltages, Adapcache writes data to two separate cache-lines simultaneously in order to use one line for error recovery when the other line is faulty. In near threshold supply voltages, Adapcache writes data to three separate cache-lines simultaneously in order to provide the correct data based on bitwise majority voter. We design and simulate one embodiment of the Adapcache as a 64-KB L1 data cache with 45-nm CMOS technology at 2GHz processor frequency for almost nominal supply voltages (1V-0.6V), at 900MHz for middle supply voltages (0.6V-0.4V), and at 400MHz for near threshold supply voltages (0.4V-0.32V). According to our experimental results, the energy reduction and latency as well as cache capacity usage are improved compared to typical previous proposals, Triple Modular Redundancy (TMR) and Double Modular Redundancy (DMR) techniques and also to the state of the art proposal, Parichute Error Correction Code (ECC).Postprint (published version

    FIMSIM: A fault injection infrastructure for microarchitectural simulators

    Get PDF
    Fault injection is a widely used approach for experiment-based dependability evaluation in which faults can be injected to the hardware, to the simulator or to the software. Simulation based fault injection is more appealing for researchers, since it can be utilized at the early design stage of the processor. As such, it enables a preliminary analysis of the correlation between the criticality of circuit level faults and their impact on applications. However, the lack of publicly available fault injectors for microarchitecture level simulators brings extra burden of designing and implementing fault injectors to the researchers who evaluate microarchitecture dependability. In this study, we present FIMSIM, to the best of our knowledge, the first publicly available fault injection simulator at the microarchitecture level. FIMSIM is a compact tool which is capable of injecting transient, permanent, intermittent and multi-bit faults. Therefore, FIMSIM provides the opportunity to comprehensively evaluate the vulnerability of different microarchitectural structures against different fault models.Postprint (published version

    Designing and modelling selective replication for fault-tolerant HPC applications

    Get PDF
    Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.This work is supported in part by the European Union Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and the FEDER funds under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    FaulTM: Fault-tolerance using hardware transactional memory

    Get PDF
    Fault-tolerance has become an essential concern for processor designers due to increasing soft-error rates. In this study, we are motivated by the fact that Transactional Memory (TM) hardware provides an ideal base upon which to build a fault-tolerant system. We show how it is possible to provide low-cost faulttolerance for serial programs by using a minimallymodified Hardware Transactional Memory (HTM) that features lazy conflict detection, lazy data versioning. This scheme, called FaulTM, employs a hybrid hardware-software fault-tolerance technique. On the software side, FaulTM programming model is able to provide the flexibility for programmers to decide between performance and reliability. Our experimental results indicate that FaulTM produces relatively less performance overhead by reducing the number of comparisons and by leveraging already proposed TM hardware. We also conduct experiments which indicate that the baseline FaulTM design has a good error coverage. To the best of our knowledge, this is the first architectural fault-tolerance proposal using Hardware Transactional Memory.Peer ReviewedPostprint (published version

    Biosorption of Cr(VI) by free and immobilized Pediastrum boryanum biomass: equilibrium, kinetic, and thermodynamic studies

    Get PDF
    15th International Symposium on Toxicity Assessment (ISTA) -- JUL 03-08, 2011 -- City Univ Hong Kong, Hong Kong, PEOPLES R CHINAWOS: 000306790200053PubMed ID: 22374187The biosorption of Cr(VI) from aqueous solution has been studied using free and immobilized Pediastrum boryanum cells in a batch system. The algal cells were immobilized in alginate and alginate-gelatin beads via entrapment, and their algal cell free counterparts were used as control systems during biosorption studies of Cr(VI). The changes in the functional groups of the biosorbents formulations were confirmed by Fourier transform infrared spectra. The effect of pH, equilibrium time, initial concentration of metal ions, and temperature on the biosorption of Cr(VI) ion was investigated. The maximum Cr(VI) biosorption capacities were found to be 17.3, 6.73, 14.0, 23.8, and 29.6 mg/g for the free algal cells, and alginate, alginate-gelatin, alginate-cells, and alginate-gelatin-cells at pH 2.0, which are corresponding to an initial Cr(VI) concentration of 400 mg/L. The biosorption of Cr(VI) on all the tested biosorbents (P. boryanum cells, alginate, alginate-gelatin, and alginate-cells, alginate-gelatin-cells) followed Langmuir adsorption isotherm model. The thermodynamic studies indicated that the biosorption process was spontaneous and endothermic in nature under studied conditions. For all the tested biosorbents, biosorption kinetic was best described by the pseudo-second-order model.PROCORE-France/Hong Kong Joint Res Scheme, Croucher Fdn, KC Wong Educ Fd
    • …
    corecore